Superstore Dataset Analysis - NTI Project
----------------------------------------------
(1) Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
rich: Provides enhanced console output with colors and formatting (optional)¶
from rich import print
To ignore warnings¶
import warnings
warnings.filterwarnings("ignore")
______________________________________________
(2) Data Preparation
# Load Excel file
df = pd.read_excel("Cleaned Sample - NTI Project (Python) - Superstore excel.xlsx")
pd.set_option('display.max_columns',None)
df.head(10)
| Order ID | Order Date | Ship Date | Difference (Days) | Ship Mode | Customer ID | Customer Name | Segment | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Price Per Unit | Quantity | Discounted? | Discount | Sales | Profit | Cost | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CA-2014-103800 | 2014-01-03 | 2014-01-07 | 4 | Standard Class | DP-13000 | Darren Powers | Consumer | Seattle | Texas | 77095 | Central | OFF-PA-10000174 | Office Supplies | Paper | Message Book, Wirebound, Four 5 1/2" X 4" Form... | 10.281250 | 2 | YES | 0.2 | 16.45 | 5.55 | 20.5625 |
| 1 | CA-2014-112326 | 2014-01-04 | 2014-01-08 | 4 | Standard Class | PO-19195 | Phillina Ober | Home Office | San Diego | Illinois | 60540 | Central | OFF-LA-10003223 | Office Supplies | Labels | Avery 508 | 4.908333 | 3 | YES | 0.2 | 11.78 | 4.27 | 14.7250 |
| 2 | CA-2014-112326 | 2014-01-04 | 2014-01-08 | 4 | Standard Class | PO-19195 | Phillina Ober | Home Office | San Diego | Illinois | 60540 | Central | OFF-ST-10002743 | Office Supplies | Storage | SAFCO Boltless Steel Shelving | 113.641667 | 3 | YES | 0.2 | 272.74 | -64.77 | 340.9250 |
| 3 | CA-2014-112326 | 2014-01-04 | 2014-01-08 | 4 | Standard Class | PO-19195 | Phillina Ober | Home Office | Evanston | Illinois | 60540 | Central | OFF-BI-10004094 | Office Supplies | Binders | GBC Standard Plastic Binding Systems Combs | 8.850000 | 2 | YES | 0.8 | 3.54 | -5.49 | 17.7000 |
| 4 | CA-2014-141817 | 2014-01-05 | 2014-01-12 | 7 | Standard Class | MB-18085 | Mick Brown | Consumer | Baltimore | Pennsylvania | 19143 | East | OFF-AR-10003478 | Office Supplies | Art | Avery Hi-Liter EverBold Pen Style Fluorescent ... | 8.141667 | 3 | YES | 0.2 | 19.54 | 4.88 | 24.4250 |
| 5 | CA-2014-130813 | 2014-01-06 | 2014-01-08 | 2 | Second Class | LS-17230 | Lycoris Saunders | Consumer | Akron | California | 90049 | West | OFF-PA-10002005 | Office Supplies | Paper | Xerox 225 | 6.480000 | 3 | NO | 0.0 | 19.44 | 9.33 | 19.4400 |
| 6 | CA-2014-106054 | 2014-01-06 | 2014-01-07 | 1 | First Class | JO-15145 | Jack O'Briant | Corporate | San Jose | Georgia | 30605 | South | OFF-AR-10002399 | Office Supplies | Art | Dixon Prang Watercolor Pencils, 10-Color Set w... | 4.260000 | 3 | NO | 0.0 | 12.78 | 5.24 | 12.7800 |
| 7 | CA-2014-167199 | 2014-01-06 | 2014-01-10 | 4 | Standard Class | ME-17320 | Maria Etezadi | Home Office | Greenville | Kentucky | 42420 | South | FUR-CH-10004063 | Furniture | Chairs | Global Deluxe High-Back Manager's Chair | 285.980000 | 9 | NO | 0.0 | 2573.82 | 746.41 | 2573.8200 |
| 8 | CA-2014-167199 | 2014-01-06 | 2014-01-10 | 4 | Standard Class | ME-17320 | Maria Etezadi | Home Office | Los Angeles | Kentucky | 42420 | South | OFF-BI-10004632 | Office Supplies | Binders | Ibico Hi-Tech Manual Binding System | 304.990000 | 2 | NO | 0.0 | 609.98 | 274.49 | 609.9800 |
| 9 | CA-2014-167199 | 2014-01-06 | 2014-01-10 | 4 | Standard Class | ME-17320 | Maria Etezadi | Home Office | Houston | Kentucky | 42420 | South | OFF-AR-10001662 | Office Supplies | Art | Rogers Handheld Barrel Pencil Sharpener | 2.740000 | 2 | NO | 0.0 | 5.48 | 1.48 | 5.4800 |
# Basic info: column types, non-null values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9994 entries, 0 to 9993 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order ID 9994 non-null object 1 Order Date 9994 non-null datetime64[ns] 2 Ship Date 9994 non-null datetime64[ns] 3 Difference (Days) 9994 non-null int64 4 Ship Mode 9994 non-null object 5 Customer ID 9994 non-null object 6 Customer Name 9994 non-null object 7 Segment 9994 non-null object 8 City 9994 non-null object 9 State 9994 non-null object 10 Postal Code 9994 non-null int64 11 Region 9994 non-null object 12 Product ID 9994 non-null object 13 Category 9994 non-null object 14 Sub-Category 9994 non-null object 15 Product Name 9994 non-null object 16 Price Per Unit 9994 non-null float64 17 Quantity 9994 non-null int64 18 Discounted? 9994 non-null object 19 Discount 9994 non-null float64 20 Sales 9994 non-null float64 21 Profit 9994 non-null float64 22 Cost 9994 non-null float64 dtypes: datetime64[ns](2), float64(5), int64(3), object(13) memory usage: 1.8+ MB
from ydata_profiling import ProfileReport
report = ProfileReport(df)
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
report
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
0%| | 0/23 [00:00<?, ?it/s] 100%|██████████| 23/23 [00:00<00:00, 91.32it/s][A
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
______________________________________________
(3) Missing Valuess ??
# dataset doesn't have missing values
df.isnull().values.any()
False
print("\nMissing Values (True = Missing):")
print(df.isnull().head())
Missing Values (True = Missing):
Order ID Order Date Ship Date Difference (Days) Ship Mode Customer ID \ 0 False False False False False False 1 False False False False False False 2 False False False False False False 3 False False False False False False 4 False False False False False False Customer Name Segment City Remove D State Postal Code Region \ 0 False False False False False False False 1 False False False False False False False 2 False False False False False False False 3 False False False False False False False 4 False False False False False False False Product ID Category Sub-Category Product Name Price Per Unit Quantity \ 0 False False False False False False 1 False False False False False False 2 False False False False False False 3 False False False False False False 4 False False False False False False Discounted? Discount Sales Profit Cost City (UNIQUE) \ 0 False False False False False False 1 False False False False False False 2 False False False False False False 3 False False False False False False 4 False False False False False False Count (COUNTIF) Total Sales per City 0 False False 1 False False 2 False False 3 False False 4 False False
print("[bold #FF0000]Total missing values per column[/bold #FF0000]")
print(f"[bold]{df.isnull().sum()}[/bold]")
Total missing values per column
Order ID 0 Order Date 0 Ship Date 0 Difference (Days) 0 Ship Mode 0 Customer ID 0 Customer Name 0 Segment 0 City 0 State 0 Postal Code 0 Region 0 Product ID 0 Category 0 Sub-Category 0 Product Name 0 Price Per Unit 0 Quantity 0 Discounted? 0 Discount 0 Sales 0 Profit 0 Cost 0 dtype: int64
print("\n[bold blue]Total Missing Values in the Dataset:[/bold blue]")
print(f"[bold #FF0000]{df.isnull().sum().sum()} [/bold #FF0000]")
Total Missing Values in the Dataset:
0
______________________________________________
Duplicates¶
print(f"[bold] {df.duplicated().any()}[/bold]")
False
______________________________________________
(4) Descriptive Analysis of Profit
Overall Profit summary (mean, median, min, max, std).
print(df['Profit'].describe())
count 9994.000000 mean 28.656973 std 234.260203 min -6599.980000 25% 1.730000 50% 8.665000 75% 29.360000 max 8399.980000 Name: Profit, dtype: float64
Total Profit
print(df['Profit'].sum())
286397.79
______________________________________________
(5) Profit by Category and Sub-Category
Which Categories (Furniture, Technology, Office Supplies) generate more profit?
print(df.groupby('Category')['Profit'].sum().sort_values(ascending=False))
Category Technology 145455.66 Office Supplies 122490.88 Furniture 18451.25 Name: Profit, dtype: float64
Which Sub-Categories have negative or low profit?
print(df.groupby('Sub-Category')['Profit'].sum().sort_values(ascending=False))
Sub-Category Copiers 55617.90 Phones 44516.25 Accessories 41936.78 Paper 34053.34 Binders 30221.64 Chairs 26590.15 Storage 21279.05 Appliances 18138.07 Furnishings 13059.25 Envelopes 6964.10 Art 6527.96 Labels 5546.18 Machines 3384.73 Fasteners 949.53 Supplies -1188.99 Bookcases -3472.56 Tables -17725.59 Name: Profit, dtype: float64
______________________________________________
(6) Profit by Region / State / City
Geographic breakdown of profit (top/bottom states/cities)
print(df.groupby('Region')['Profit'].sum().sort_values(ascending=False))
Region West 108418.79 East 91522.84 South 46749.71 Central 39706.45 Name: Profit, dtype: float64
Identify loss-making locations
print(df.groupby('State')['Profit'].sum().nlargest(10))
State California 76381.60 New York 74038.64 Washington 33402.70 Michigan 24463.15 Virginia 18598.00 Indiana 18382.97 Georgia 16250.08 Kentucky 11199.70 Minnesota 10823.22 Delaware 9977.37 Name: Profit, dtype: float64
print(df.groupby('State')['Profit'].sum().nsmallest(10))
State Texas -25729.29 Ohio -16971.37 Pennsylvania -15560.04 Illinois -12607.89 North Carolina -7490.81 Colorado -6527.86 Tennessee -5341.66 Arizona -3427.87 Florida -3399.25 Oregon -1190.48 Name: Profit, dtype: float64
______________________________________________
(7) Visualization
A) Profit vs. Sales
Relationship between Sales and Profit (sometimes high sales --> low profit due to discounts)
plt.scatter(df['Sales'], df['Profit'])
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.title('Sales vs Profit')
plt.show()
B) Profit by Category
plt.figure(figsize=(10, 6))
category_profit = df.groupby('Category')['Profit'].sum().sort_values(ascending=False)
ax = sns.barplot(x=category_profit.index, y=category_profit.values, palette='viridis')
plt.title('Profit by Category', fontsize=18, fontweight='bold', style='italic', color="#08615A")
plt.xlabel('Category', fontsize=15, fontweight='bold', color="#239406")
plt.ylabel('Profit', fontsize=15, fontweight='bold', color="#360078")
plt.xticks(rotation=20, ha='center')
plt.tight_layout()
palette = sns.color_palette("viridis", n_colors=len(category_profit))
for i, ticklabel in enumerate(ax.get_xticklabels()):
ticklabel.set_color(palette[i])
ticklabel.set_fontweight('bold')
ticklabel.set_fontstyle('italic')
plt.show()
C) Profit by Region
plt.figure(figsize=(10, 6))
region_profit = df.groupby('Region')['Profit'].sum().sort_values(ascending=False)
ax = sns.barplot(x=region_profit.index, y=region_profit.values, palette='coolwarm')
plt.title('Profit by Region', fontsize=18, fontweight='bold', style='italic', color="#590057")
plt.xlabel('Region', fontsize=15, fontweight='bold', color="#084298")
plt.ylabel('Profit', fontsize=15, fontweight='bold', color="#198754")
plt.tight_layout()
plt.show()
D) Profit by Sub-Category
import plotly.express as px
sub_profit = df.groupby('Sub-Category')['Profit'].sum().reset_index().sort_values('Profit', ascending=False)
fig = px.bar(
sub_profit,
x='Sub-Category',
y='Profit',
text='Profit',
title="Profit by Sub-Category",
color='Profit',
color_continuous_scale='Blues'
)
fig.update_traces(
texttemplate='%{text:,.0f}',
textposition='outside',
hovertemplate='<b>%{x}</b><br>Profit: %{y:,.0f}'
)
fig.update_layout(
title="Profit by Sub-Category",
width=900,
height=500,
title_x=0.3,
title_font=dict(size=20, color='#070047',style='italic'),
xaxis_tickangle=-45
)
fig.show()
E) Profit vs Discount
fig = px.scatter(
df,
x="Discount",
y="Profit",
color="Category",
size="Sales",
title="<b><i>Discount Impact on Profit</i></b>",
hover_data=['Sub-Category','Product Name']
)
fig.update_traces(marker=dict(opacity=0.7, line=dict(width=1, color='DarkSlateGrey')))
fig.update_layout(title_x=0.3, title_font=dict(size=20, color='#004085',style='italic'))
fig.show()
F) Profit by Segment
segment_profit = df.groupby('Segment')['Profit'].sum()
fig = px.pie(
names=segment_profit.index,
values=segment_profit.values,
title="<b>Profit Distribution by Customer Segment</b>",
hole=0.3,
color_discrete_sequence=px.colors.qualitative.Dark24_r
)
fig.update_traces(
textposition='inside',
textinfo='percent+label',
hovertemplate='Segment:<b> %{label}</b><br>Profit: <b>%{value:,.0f}</b>',
insidetextfont=dict(size=14)
)
fig.update_layout(
title=dict(
x=0.5,
font=dict(size=20)
),
legend=dict(
orientation="h",
y=-0.1,
x=0.35
)
)
fig.show()
G) Profit Trend Over Time
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['YearMonth'] = df['Order Date'].dt.to_period('M').astype(str)
profit_trend = df.groupby('YearMonth')['Profit'].sum().reset_index()
fig = px.line(
profit_trend,
x='YearMonth',
y='Profit',
title="Profit Trend Over Time",
markers=True
)
fig.update_traces(line=dict(color='#0077b6', width=3))
fig.update_layout(
title=dict(
x=0.25,
font=dict(size=20, color='#003049')
)
)
fig.show()
______________________________________________
(5) Calculate RFM metrics
define Recency, Frequency, Monetary for each customer
latest_date = df['Order Date'].max() + pd.Timedelta(days=1)
rfm = df.groupby('Customer ID').agg({
'Order Date': lambda x: (latest_date - x.max()).days, # Recency
'Order ID': 'nunique', # Frequency
'Sales': 'sum' # Monetary
})
rfm.columns = ['Recency', 'Frequency', 'Monetary']
rfm.head()
| Recency | Frequency | Monetary | |
|---|---|---|---|
| Customer ID | |||
| AA-10315 | 185 | 5 | 5563.56 |
| AA-10375 | 20 | 9 | 1056.39 |
| AA-10480 | 260 | 4 | 1790.51 |
| AA-10645 | 56 | 6 | 5086.94 |
| AB-10015 | 416 | 3 | 886.15 |
🤠loyal customers
(low Recency, high Frequency, high Monetary)
😮💨churned customers
(high Recency, low Frequency, low Monetary)
🤑big spenders
(low Monetary --> low interest, high Monetary --> VIP)
rfm_classified = rfm.copy()
def classify_customer(row):
labels = []
if row['Recency'] < 30 and row['Frequency'] > 5 and row['Monetary'] > 2000:
labels.append("Loyal")
if row['Recency'] > 100 and row['Frequency'] <= 2 and row['Monetary'] < 5000:
labels.append("Churned")
if row['Monetary'] > 10000:
labels.append("Big Spenders")
if row['Recency'] < 60:
labels.append("Active")
if row['Frequency'] > 3:
labels.append("Regular Buyer")
if row['Monetary'] > 1000:
labels.append("Good Spender")
return " & ".join(labels) if labels else "Unclassified"
rfm_classified["Classification"] = rfm_classified.apply(classify_customer, axis=1)
rfm_classified.head(10)
| Recency | Frequency | Monetary | Classification | |
|---|---|---|---|---|
| Customer ID | ||||
| AA-10315 | 185 | 5 | 5563.56 | Regular Buyer & Good Spender |
| AA-10375 | 20 | 9 | 1056.39 | Active & Regular Buyer & Good Spender |
| AA-10480 | 260 | 4 | 1790.51 | Regular Buyer & Good Spender |
| AA-10645 | 56 | 6 | 5086.94 | Active & Regular Buyer & Good Spender |
| AB-10015 | 416 | 3 | 886.15 | Unclassified |
| AB-10060 | 55 | 8 | 7755.63 | Active & Regular Buyer & Good Spender |
| AB-10105 | 42 | 10 | 14473.57 | Big Spenders & Active & Regular Buyer & Good S... |
| AB-10150 | 42 | 5 | 966.71 | Active & Regular Buyer |
| AB-10165 | 26 | 8 | 1113.85 | Active & Regular Buyer & Good Spender |
| AB-10255 | 167 | 9 | 914.53 | Regular Buyer |
______________________________________________
(6) Assign R, F, M Scores
Use quantiles to give each metric a score from 1 (lowest) to 4 (highest)
# Create quantile-based scoring functions with rank to break ties
rfm['R_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1,2,3,4])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1,2,3,4])
rfm.head()
| Recency | Frequency | Monetary | R_Score | F_Score | M_Score | RFM_Score | |
|---|---|---|---|---|---|---|---|
| Customer ID | |||||||
| AA-10315 | 185 | 5 | 5563.56 | 1 | 1 | 4 | 114 |
| AA-10375 | 20 | 9 | 1056.39 | 4 | 4 | 1 | 441 |
| AA-10480 | 260 | 4 | 1790.51 | 1 | 1 | 2 | 112 |
| AA-10645 | 56 | 6 | 5086.94 | 3 | 2 | 4 | 324 |
| AB-10015 | 416 | 3 | 886.15 | 1 | 1 | 1 | 111 |
______________________________________________
(7) Customer Segmentation
Classify customers based on their RFM score patterns
def classify_rfm(rfm_score):
recency_map = {
'4': "Loyal",
'3': "Engaged",
'2': "At Risk",
'1': "Churned"
}
frequency_map = {
'4': "Frequent",
'3': "Occasional",
'2': "Seldom",
'1': "Rare"
}
monetary_map = {
'4': "VIP",
'3': "Premium",
'2': "Average",
'1': "Budget"
}
r_label = recency_map.get(rfm_score[0], "Unknown")
f_label = frequency_map.get(rfm_score[1], "Unknown")
m_label = monetary_map.get(rfm_score[2], "Unknown")
return f"{r_label} {f_label} {m_label}"
# Apply to DataFrame
rfm['Segment'] = rfm['RFM_Score'].apply(classify_rfm)
rfm.head()
| Recency | Frequency | Monetary | R_Score | F_Score | M_Score | RFM_Score | Segment | |
|---|---|---|---|---|---|---|---|---|
| Customer ID | ||||||||
| AA-10315 | 185 | 5 | 5563.56 | 1 | 1 | 4 | 114 | Churned Rare VIP |
| AA-10375 | 20 | 9 | 1056.39 | 4 | 4 | 1 | 441 | Loyal Frequent Budget |
| AA-10480 | 260 | 4 | 1790.51 | 1 | 1 | 2 | 112 | Churned Rare Average |
| AA-10645 | 56 | 6 | 5086.94 | 3 | 2 | 4 | 324 | Engaged Seldom VIP |
| AB-10015 | 416 | 3 | 886.15 | 1 | 1 | 1 | 111 | Churned Rare Budget |